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Abstract 



o 

c/5 , Learning a discriminative classifier from unlabeled data has been proven to be an effective way 

O ■ of simultaneously clustering the data and training a classifier from the data. We present a novel 

measure for evaluating the quality of a data partition by the misclassification rate of the nearest 
■^j" 1 neighbor classifier learnt from that partition. By our unsupervised training scheme, we show the 

close relationship between the misclassification rate of the unsupervised nearest neighbor classifier 
and the widely used normalized graph Laplacian produced from isotropic kernel, and the misclas- 
sification rate reduces to the well-known kernel form of graph cut assuming uniform distribution. 
By minimizing the bound for the misclassification rate we also derive a new clustering algorithm, 
i.e. the Normalized Harmonic Cut (NHC). We show that that NHC is equivalent to normalized cut 
in case of uniform distribution, and our analysis provides an unified view of clustering and clas- 
ps \ sification by evaluating the misclassification rate of the learnt classifier. Experimental results and 
comparisons with other clustering methods on real data sets demonstrate the effectiveness of the 
Normalized Harmonic Cut, which also shows the potential of our generalization analysis. 

Keywords: Unsupervised Nearest Neighbor Classifier, Generalization Analysis 



1. Introduction 

Clustering methods partitio n the data into a se t of se lf-similar clusters. Representative clustering 
methods include K-means (lHartigan and Wong . 1979 ) which minimizes the within-cluster dissimi- 



larities, spectral clustering (|Ng et all l200lh which identifies clusters of more complex shapes lying 



on some low dimensional manifolds, and statistical modeling method (IFraley and Rafteryl . [2002) 
approximates the data by a mixture of parametric distribution. Although such clustering methods 
achieve promising result in practice, they ignore the inherent connection between the obtained clus- 
ters and the classes from supervised learning perspective. As a result, they not only lose a chance of 
potentially obtaining better data partitions, but also cannot utilize the effective tools from supervised 
learning community to produce data clusters which are also suitable for further classification. 

Viewing clusters as classes, Recent works on unsupervised classification manage to learn a clas- 
sifier from unlabeled data, and they have established the connection between clustering and multi- 
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class classification from a supervised learning persp ective. (IXu et all 12004) le arns a max-margin 



two-c lass classifier in an unsupervised manner. Also, (iAgakov and Barbell 12005) and (iGomes et al 



20101 ) learn t he kernelized Gauss ian cl assifier and the ker nel logistic regression classifier respec- 



tively. Both (IGomes et all l2010h and (IBridle et all. [1991) adopt the entropy of the posterior dis- 
tribution of the class label by the classifier to measure the quality of the learnt classifier, and the 
parameters of such uns upervised classifiers ca n be computed by continuous optimization. More re- 
cent work presented in jSugiyama et al. , 201 ll ) learns an unsupervised classifier by maximizing the 
mutual information between cluster labels and the data, and the Squared-Loss Mutual Information 
is employed to produce a convex optimization problem. 

However, few methods consider the misclassification rate of the learnt classifier, one of the 
most important performance measures for classifiers, so that the performance of the unsup e rvised 
classifier is not fully evaluated by many methods. Although Bengio et al. (|Bengio et all 120031 ) 
analyzed the out-of-sample error for unsupervised learning algorithms, their method focused on 
lower-dimensional embedding of the data points and did not train a classifier from unlabeled data. 
In contrast, we learn the unsupervised nearest neighbor classifier (1-NN) from the data by our 
supervised training scheme introduced in the next section, and analyze the its misclassification rate 
on the (5-cover of the data and the entire space. By minimizing the misclassification of 1-NN on 
the entire space, we design a new clustering method called Normalized Harmonic Cut (NHC). Our 
generalization analysis shows that the misclassification rate of 1-NN is closely related to widely 
used normalized graph Laplacian by isotropic kernel actually bounds the misclassification rate of 
1-NN, and it is equivalent to the well-known graph cut (or unnormalized graph Laplacian) when 
the data points are sampled from a unifo rm distribution. Foll owing this analysis it is shown that 
the NHC is equivalent to normalized cut (|Shi and M alik. 2000) or Laplacian eigenmaps when the 
data points are mapped to one dimensional discrete label space in case of uniform distribution. 
Experimental results show the effectiveness of NHC over other clustering methods. 

Although the generalization property of 1-NN have been extensity studied since (ICover and H art. 



1967), to the best of our knowledge most analysis focuses on the case where the training data are 



random. In this work, we propose to measure the qualify of a specific data partition by the misclas- 
sification rate of 1-NN learnt from that partition, so we derive the misclassification rate of 1-NN 
with fixed training data (details in next section). 

The rest part of this paper is organized as follows. We first introduce the formulation of un- 
supervised 1-NN classification in Section [2 then derive the misclassification rate of 1-NN and the 
Normalized Harmonic Cut algorithm in Section [3] After that, we demonstrate and analyze the 
experimental results, and finally conclude the paper. 



2. Formulation of Unsupervised Nearest Neighbor Classification 

Before formulating our clustering method, we introduce the notations in the formulation of unsuper- 
vised classification by nearest neighbor classifier. Suppose we are given the data set X = {xi} i=1 C 
R D , the goal of clustering is to find the cluster assignments Y = (yi,y2, Un) to the data, where 
Di is the cluster label for Xj, yi 6 {1, 2, Q} and Q is the number of clusters. We model cluster- 
ing as a data partition problem, and the clustering algorithm partitions X into Q disjoint clusters 

C = {Ci}9 =v and X = [J d. 
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2.1 Unsupervised Training Scheme 

With any hypothetical data partition C, we can build the corresponding training data set Sc = 
{(Ci,i)}f =1 , where i is the class label for class Cj, for a potential classifier. Note that such un- 
supervised training process exhibits combinatorics property, and the number of data partitions (the 
bell number Bn) is prohibitively large in case that Q is unknown. In this way, the qualify of a data 
partition can be evaluated by the performance of the classifier learnt from that partition. Since the 
training error of 1-NN is always zero, the misclassification rate of the nearest neighbor classifier 
associated with a data partition is considered instead, and we prefer the data partition with minimal 
associated misclassification rate. 



It is worthwhile to me ntion that previous unsupervised classification methods (ISugiyama et al 



201 ll : iGomes et all 1201 fj) circumvent the above combinatorial unsupervised training scheme by 



learning a probabilistic classifier fro m the whole da ta, so they can not evaluate the classification 



performance of the learnt classifier. (|Xu et all 120041 ) learns the max-margin classifier by the com 



binatorial training scheme and minimized its misclassification rate, however, they can not handle 
the case that the number of classes Q is unknown. On the contrary, starting from the analysis of the 
misclassification rate of 1-NN, we derive a novel density-based cut function which only involves 
pairwise interactions between data points and does not depend on Q. Our new measure can be 
equipped either with a normalization step to handle fixed number of classes, resulting in a new Nor- 
malized Harmonic Cut algorithm superior to traditional normalize d cut in the experimental result, 



or the exemplar-based clustering scheme (|Frey and Dueckl . 120071) to automatically determine the 
number of clusters by model selection. 

2.2 The misclassification rate 

Suppose (X, Y) are random variables indicating unobserved data point and its class label respec- 
tively, (X, Y) G X x y, and V is the distribution over X x y. We then denote by ~E,<j c r x the 
miscl assification rate of the 1-NN learnt from the training data Sc with a loss function L ( BishopL 



2006|): 

E Sc = E {x>Y) ^ v>XeX [L (Y, NN Sc (X)) \S C ] (1) 



L(i,j) 



1 if i± j 
otherwise 



Also, the the misclassification rate constrained on R x Q X e is defined accordingly: 

E SctRx ± E (x>Y) „ VtXeRx l L (X, NN Sc (*)) \S C ] (2) 

And we prefer the data partition with minimal associated misclassification rate Es c . In our analysis 
y = {1,2, ...Q}, X is bounded by [— Mq,Mq] d , T>x is the induced marginal distribution over 
X. We suppose that the data X = {xi}^ =1 are sampled i.i.d. from T>x, and NN s {X) is the 
classification function which returns the class label of a sample X by the 1-NN rule learnt from S. 
We also let / be the density function of T>x, f]i (x) be the posterior distribution of X = x over the 
labels, i.e. rji (x) = P [Y = i\X = x], and (fj,TTj) be the density function and prior for class j. It 
is further assumed that / and {rji}f =1 satisfy the following two conditions: 
A) / is bounded, i.e. f min < f < f max . 
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B) Both / and {rji}f =1 are Lipschitz: \f (x) - f (y)\ < c\\x - y\\, |r?j (x) - rji (y)\ < Ci \\x - y\\ 
where c and {ci}f =1 are the corresponding Lipschitz constant. 
In the following text we abbreviate Sc to S. 

3. Clustering by Unsupervised Nearest Neighbor Classification 
3.1 Generalization Analysis for the 1-NN Trained with Given Data 

We start the generalization analysis of the 1-NN constrained on the 5-cover of the data X first, since 
the decision regions under the 1-NN rule can be easily constructed on the 5-cover. It is interesting 
to observe that the constrained misclassification error of 1-NN is closely related to graph cut by the 
isotropic kernel. 

Definition 1 The 5-cover of the data X is defined as B$ = {B (x m , <5)}^ =1 , where B (x m , 5) is a 
ball centered at x m with radius 5 > 0, and B (x;, 5) f] B (x m , 5) = for any I ^ m. 

Suppose TZi t s is the decision region for class i under the 1-NN rule trained from S, then it can be 
verified that |J B (x m , 5) C T^g. Since the 5-cover is comprised of the subsets of the decision 

Xm^Ci 

Q 

regions, namely |J |J B (x rn , 5) = B$, we can derive the generalization properties of the 1-NN 

%=\ Xm^iCi 

on the (5-cover B$ by Theorem [2 Because we will estimate the underlying probabilistic density 
function (e.g. / and fj) frequently in the following text, we introduce the non-parametric kernel 
density estimator for fj and / as below: 

1 . i^ 

— Y,K h {x-xi) f = -Y J K h {x-x l ) (3) 



\CA N 
1 Jl xi&Cj i=i 



where 



n2 



is the isotropic Gaussian kernel with bandwidth h ( Silverman . 1986h . and we use TTj = as an 



estimator for itj. Extensive study proves that the kernel d ensity estimator Q uniform ly converges 



W i » l 1 1 1 1 Cl L V 7 l 1 V71 l\ j . LjAL^I 111 1 VVj i> I UUJ piUYVjO L11UI LllV^ IxVjl I lul U VUiJliy VJlllllUl Wl VI - ' t> U1111V11111 

to the underlying density almost surely, as summarized in (|Wied and WeiBbachl . l2010h . 



Theorem 2 /. The misclassification rate (0) of the 1-NN rule constrained on the 5-cover of the 
data X in terms of the class-conditional density functions and priors is 

Q , 

^S,B S = ^2^2 h (x)TTjdx (5) 



i=l jjti 



U B(x m ,S) 



2. The estimator for Ks,b s using \fj, 7Tj\ is given by 

I J 7 = 1 



1 N f 

E SBj =IF zl 0l™ / K h (x - x t )dx (6) 
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and 



where M 



S D 



c D 



^ Oi m K h (x m - xi) 



/,m— 1 



N 



< VlmMS 2 



(7) 



/,m— 1 



and cq is the volume of the unit ball in R , 9i m is a class 



(2n)^h D + 2 (D + 2) 

indicator function such that 6i m = 1 if X[ , x m belongs to different classes in S and otherwise. 
Proof 1. 

E S ,B S = Exen s [Ey [L (Y, NN S (X)) \S , X}} 
= E XeBs [P [Y^NNs (X)\S,X]] 
Q 

= E E « U B{ Xm ,5)[P[Y^i\X}] 
Q . 

i=l m J U B( Xm ,6) 
2. It can be proved by expanding the kernel function by the second-order Taylor series. 



A' 



The function ®im K h Or. 

Lm=l 



x{) in f7]) is actually the graph cut function produced by 



isotro pic kernel w hich is widely used in clustering and segmentation since (IWu and Leahyl 119931) 
and rWeissL [T999). and we can see that this graph cut function bounded the constrained misclassifi- 
cation rate Es : b s (normalized by 5 D ) from ^J}. Also, 



lim S ' Bs 



s^o 5 D 



N 

E 

Lm=l 



QlmKh (x m ~ Xl) 



Therefore, minimizing the above cut function also enforces the minimization of ^s,B s normalized 
by 5 D when 5 is small enough. We regard this as a connection between unsupervised min-cut and 
the bound for the misclassification rate of 1-NN. Since small 5 undermines the influence of non- 
uniform distribution on the misclassification rate, and the misclassification rate is originally defined 
on the entire space, we extend our analysis and consider its misclassification rate on the entire space 
X to further exploit the generalization ability of 1-NN. 

Since 1-NN rule makes hard decision for a given datum, we introduce the following soft NN cost 
function which converges to the 1-NN cla ssification function, wh ich is similar to the one adopted 
by Neighbourhood Components Analysis (iGoldberger et all 12004): 



Definition 3 The soft 1-NN cost function is defined as 



NN h * tS (x,i) 



N 

K h * (x - xi)Z Cl (a?z) 
i=i 

N 

K h * (x - xi) 

l=i 



(8) 
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where NN^* s represents the probability that the datum x is assigned to class i by 1-NN 
learnt from S, and I is an indicator function. 

Then we have 

Theorem 4 The misclassification rate of the 1-NN trained with fixed training data S is given by 

E s = lim E s , h * (9) 

where 

E s , h *= ®x[vi{X)NN h *,s(X,j)] (10) 

i,j=l,...,Q,i+j 

Theorem @] explicitly gives the expression for the misclassification rate of the 1-NN trained with 
fixed training data S. In order to apply it in real problems, it is particularly important to derive 
the bound for the misclassification rate ©. Theorem [5] shows that, with a large probability, (fTOl) is 
bounded by a new cut function: 

Theorem 5 With probability greater than 1 — 2Re~ Mh * , the misclassification rate of the 1-NN 
evaluated at h*, i.e. Eg/,*, satisfies: 



where M h * = -2N(2ir) D h* 2D e 2 , e = (t$> + c) ^Dr + e + T$\ T$,T$,e > 0, r£> 

1 (2) ~ 

— irpj — ptpt, lim T{-J = 0, h*,T,e are small enough such that e < f m m- c is the Lipschitz, 

constant for the density function f. 

Theorem [5] provides bounds on the misclassification rate evaluated at h* , i.e. Eg^*. By The- 
orem |U we further analyze the asymptotic case when h* — > in the following theorem, where a 
sequence {h* N }^ converging to zero is constructed and the limit for the associated error sequence 

Ec h* t is bounded. 



N=l 



Theorem 6 Let {h* N }™ =1 be a sequence such that lim h*^ — and h*^ > N ^ with d < 
When N — > oo, then with probability 1, 

—E l 2 wer <E Sh * < —E^ pper (12) 

E u PP er = £ £ ( } «ifi(*l) (13) 

i+3 1=1 

where Aq > is a constant, tq,e > such that XqTq + e < /mm- 
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By Theorem [6l the bound for the misclassification rate of 1-NN is faithfully obtained when 
the number of training samples goes to infinity. The asymptotic analysis presented by (fl2l ) can 
also be reasonably applied to estimate the misclassification rate of 1-NN trained with finite training 
samples, especially in case of large training data. Similar to Theorem |2l we use kernel density 

estimators j/j, 7Tj j by (0) to estimate the underlying class-conditional density functions and 

class priors, which results in the estimator for the upper bound in (fl2l) as below (we neglect the 
constant factor 1/N) here: 



i+j 1=1 fixi)- A T - £ 

N 

N Y, K h (xi - x m )l Cl (x m ) 

Mj 1=1 7? E K h {xi - x k ) - A r - e 

k=l 

2 9i m Hi m (15) 

l<m 



„ A K h (Xl - X r 

tilra 



Hmean (KSi, KS m ) 

(fl5T > is a cut function where Hmean (• , •) indicates the harmonic mean of two numbers, and KSt 

N 

is defined as KSt = E Kh i x t — x^) — N {XqTq + e) for any x t G X. In order to maximize the 

k=l 

probability with which (fTTT) holds, we prefer larger r (also ro) and larger e, so that AoTo + e should 

JV 

be maximized. Since ^ (0) < Yl Kh ( x t — x k) for every t (the lower bound is tight when h 

k=l 

N 

is small enough) and the inequality AoTo + e < ^2 Kh (xt — should hold for any 1 < t < N, 

k=l 

we take AoTo + e = jfKh (0), and the resultant cut is 

TT _ -Kfe (Xl — Xm) . 

n-lra = -. (ID) 

f N N 

Hmean I J2 K h (xi-x k ), J2 K h (x m -x k ) 

\k=l,k^l fc=l,fc^m 

We call ( fT6l ) harmonic cut since it involves the harmonic mean of two kernel sums. The har- 
monic cut function (fT5T > is the kernel density estimator of the asymptotic bound for the misclassifi- 
cation rate of the 1-NN, and we need to minimize it following our clustering criteria in Section[2] 

3.2 Minimizing the Misclassification Rate by NHC 

Based on the analysis before, we design a clustering algorithm which minimizes the asymptotic 
bound for the misclassification rate of the 1-NN, so the objective function for our clustering algo- 
rithm in terms of kernel density estimator is actually the harmonic cut function (fT3T >. Written in a 
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matrix form, (fT5T > is equivalent to Tr (Y T LY^ where Y = [Y\ Y2... Yq] and each Y{ is a column 

vector of length N such that Y^ = 1 if Xk E Cj and = otherwise. We define W = [wij] NxN 

where uiij = Hij by (fT6l >. the diagonal matrix D where Da = Y^,f=i w%j, and L = D — W. It can 
be verified that 

ST = 2 E ^™ = Tr ( yTLy ) 

Therefore, the formulation of our clustering algorithm is below (we relax Y to take real values): 

min Tr (Y T LY) (17) 
y t dy=i v ' 

where D is a diagonal matrix where D„ = J2f=i w ij * s me sum °f tne elements of the i-th row of 
W = \wjj\, and Wjj =J£h {x% — Xj) by the isotropic kernel K^. Similar to the Laplacian eigenmaps 
dBelkin and Nivogil . l2003h . the constraint Y T DY = I removes the scaling factor when minimizing 



( fTTT ) with respect to Y. Note that W = D~ l W + (D~ l W^j which can be represented in terms of 
normalized graph LaplaciarQ. Moreover, if T>x is a uniform distribution, / is a constant and hence 
we do not need / as an estimator for / in (fT3T ). and W = W (up to a constant factor) which is the 
graph cut function produced by the isot ropic kernel. In this case, we can see that (fTTT ) is equivalent 
to normalized cut (|Shi and MalikLbOQOh . 



Following the method for Rayleigh quotient dGolub and van Van Loarl[l996h . (fTTT) is minimized 



by solving the generalize eigenvalue system shown in Algorithm Q] It should be emphasized that 
unlike normalized cut where the isotropic kernel matrix W = [K^ (x\ — x m )] is used to measure 
the affinity between data points, we employ the harmonic cut W = [Hi m ] as the affinity measure. 
Note that the implementation of Algorithm Q] is quite similar to that of normalized cut, and the 
optimization is performed efficiently by solving a generalized eigenvector problem. 

Algorithm 1 Normalized Harmonic Cut (NHC) 
1: X: the data set for clustering, Q: the number of clusters; 

2: Calculate the harmonic cut matrix W = [wij] NxN by (fT6l ) where Wij = H^, and the diagonal 

matrix D where Da = X^jLi 
3: Compute the similarity matrix W = [wij] NxN where Wij = (x, — xj), and the diagonal 



4 



matrix D where Da = Y2j=i w ij- 

Compute the unnormalized Laplacian matrix L = D — W. 



5: Compute the first Q generalized eigenvectors for Lt = XDt 

6: Denote the obtained Q eigenvectors by Y = \Y\ Y2 ... Yq], perform /c-means clustering on the 

rows of Y. Suppose the cluster label for the i-th row is yi, 1 < i < N. 
7: Assign the cluster label yi to data point x%, 1 < i < N. 



4. Connection to Existing Methods 

Derived from the bound for the misclassification rate of the 1-NN, the harmonic cut exhibits interest- 

N 

ing similarity to existing cut-based clustering methods. The graph cut function ^im^-h ( x m — %i) 

l,m=l 



1. Normalized graph Laplacian is L — I — D 1 W and W is related to L by W = 21 — L — L T 



8 



Generalization Analysis for Unsupervised Nearest Neighbor Classification 



has been extensively used for clustering and segmen tation by ( Wu and Leahy . 1993 : Weiss . 19991 : 
Shi and MalikL 12000 : LNg et all 1200 luLuxburgL 120071) . which also bounds the misclassification rate 
of 1-NN on the 5-cover and on the entire space in case of uniform distri bution. Such graph c ut func- 
tion easily produces imbalanced data partitions, and normalized cut ( Shi and Malik . 200d) solves 
this problem by introducing a normalization factor in the cut function which controls the cluster size 
and obtains more desirable results. It should be emphasized that all these methods heavily rely on 
the pairwise similarity or affinity matrix W which encodes the affinity between data points, and two 
data points are more likely to be assigned to different clusters if the affinity between them is low. In 
contrast to the cut function by isotropic Gaussian kernel, the harmonic cut function used in NHC is 
built upon the generalization analysis of unsupervised 1-NN classification, which shows advantages 
over other cut-based clustering methods in real data sets. As mentioned in previous section, normal- 
ized cut becomes a special case of NHC assuming uniform distribution. More over, the harmonic cut 



(fl6l ) is similar to the row- normalized kernel for building the diffusion map (ICoifman et all 12005 
Coif man and Lafon . 2006), which is used to account for the influence of non-uniform distribution 



on the affinity f unction. Similar to normalized cut ( Shi and Malik , 2000h or spectral clustering 
(|Ng et all l2001[) . NHC is efficient since it only requires solving a eigenproblem shown in Algo- 
rithm GQ 

NHC is also closely related to the cut function in ( Narayanan et al. . 2006). In this work, the 
data X lies on a domain $7 C R D . Suppose p is the probability density function on O and S is the 
boundary which separate Q, into two parts. Let X\ and X2 be the partition of X by S. Then the 
authors provide the asymptotic analysis proving that the following cut function 



xi&X\ x m eX2 

converges to the volume of the class boundary S, i.e. f p (s)ds, where Gmean (• 

s 

metric mean of two numbers and V is defined as 



(18) 



is the geo- 



Vlm 



K hN (xi - x m ) 



(19) 



Gmean K h N (xi~x k ), K h P 

\k^l ky^m 



(Xr, 



Xk) 



Compared to the harmonic cut (U6l ). we observe that Hi m > Vi m when h = by Cauchy-Schwarz 
inequality. So that harmonic cut is the tight upper bound for the cut (fT9l given a fixed kernel 

bandwidth, and the latter eventually converges to J p (s)ds (multiplied by a term yj jj^)- According 

S 

to the widely accepted Low Density Separation assumption stating that the class boundary tends to 
pass through regions of low density, the low f p (s)ds is preferred. Minimizing the harmonic cut 



actually follows this principle. 



5. Experimental Results 

While our main contribution focuses on the derivation of the generalization ability for unsupervised 
1-NN classification, we derive the NHC algorithm based on our generalization analysis and show 
its performance in real data sets in this section. 
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Since the formulation of NHC is derived from graph cut perspective, so the NHC algorithm falls 
into the category of cl ustering methods based on cut, and we compare NHC to Normalized Cut (NC) 
dShiandMalrkL EoOO) and K-means. NHC is also compared to spectral clustering (SC) (|Ng et al 



2001), since Shi's NC and Ng's SC are actually two versions of normalized spe ctral clustering. 

We use the popular adjusted rand index (ARI) (IHubert and Arabid . 1 1985Q for evaluating the 
performance of the clustering methods. ARI is the adjusted-for-chance version of rand index, and it 
has been widely used as a measure of agreement between the inferred cluster labels and the ground 
truth cluster assignments. ARI ranges from —1 to 1, and it achieves the maximum 1 when the 
inferred label is identical to the ground truth. A higher ARI indicates a better agreement between 
the inferred data partition and the ground truth partition. 

The input data set is centralized and its variance is normalized in each dimension (with unit 
variance) before we feed it into clustering methods. We apply K-means, Spectral Clustering, Nor- 
malized Cut and Normalized Harmonic Cut to three UCI repository (Vertebral Column, Breast 
Tissue, SPECT Heart). We choose the kernel bandwidth h in the kernel density estimator (PT6l) as 



h = aDist ma _ x , where Dist ma _ x is the maximum squared distance between data points, and a is a 
bandwidth ratio. In our experiments we let a range from 0.01 to 0.2 to see the performance change 
of various clustering methods. For a fair comparison SC, NC, and NHC share the same kernel band- 
width h. Since all the four clustering methods involve random initialization in the K-means step, 
we run them 50 times and take the average. The clustering result is shown in Figure Q] and [2] 
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Figure 1: Clustering on UCI Vertebral Column and Breast Tissue Data Sets 

As expected from our theoretical analysis, NHC always performs better than NC, and it also 
frequently renders better result than SC. We also observe that NHC is stable with varying kernel 
bandwidth. 



6. Conclusion 

Learning a classifier from unlabeled data is promising for both clustering and classification. We 
learn the nearest neighbor classifier from the data in an unsupervised manner, and analyze the its 
misclassification rate on the (5-cover of the data and on the entire space. Our generalization analysis 
shows the close relationship between the widely used graph Laplacian produced by isotropic kernel 
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UCI SPECT Heart 
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Figure 2: Clustering on UCI SPET Heart Data Set 

and the misclassification rate of 1-NN. We further obtain the asymptotic bound for the misclassifi- 
cation rate of 1-NN by kernel density estimation. A new clustering method, Normalized Harmonic 
Cut, is derived by minimizing such asymptotic misclassification bound. Effectiveness of NHC is 
evidenced by experiments on real UCI data sets. 



Appendix 

Proof [Proof of Theorem U It can be verified that 



lim NN h * s (x, i)=P [NN S (X) =i\X = x,S] 



Similar to the proof of Theorem [2l we have 

E S = E (XX) [L (Y,NN S (X))\S] 
= E X [E y [L(Y,NN S (X))\S ,X}) 
= E X [P [Y^NN S (X)\S ,X}\ 



E.v P[Y = i\X}P[NN s (X)=j\X,S] 

i,j=h— ,Q^3 

®xfa(X)P[NN s (X)=j\X,S]} 

i,j=l,...,Q,i^j 



(20) 



(21) 



since Y and NNg (X) are conditionally independent given X. Also, the product rji (X) NN^* g (X, j) < 
1, based on (l20l ) and the dominated convergence theorem, we get 



lim E A - 



m {X) NN h * >s (X,j) 



E 



x 



lim r H (X)NN h ^ s (X,j) 

h*— >0 



E x [ Vl (X)P[NN s (X)=j\S,X]} 



(22) 
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Substituting (1221 ) into (I2T1) . we finish the proof. ■ 
Bofore the proof of Theorem [51 we introduce the following two lemmas: 

Lemma 7 Suppose T> is a probabilistic distribution with density function f which satisfies the two 
conditions A and B in Section 2, then we have 



VztX lim / / (x) K h * (x- z)dx = f (z) 
h*->a J x 



(23) 



where K is defined as a Gaussian kernel Kh* (x) 
uniform. 

Lemma 8 The function 



1 ,d/2 e 2h * 2 > an d this convergence is 



(2nh* 



N 



T ( x ) = ^2 K h* (X ~ Xi) Xi G X 



(24) 



is uniformly continuous, i.e. \T (x) — T (y)\ < Th* \\x — y\\ for any x,y G X, where Th* = 

AT 

e l/2( 27r )D/2 h ,D+l- 

Proof [Proof of Theorem |H Let {Pi, P2, Pr} be the r-cover of the set X. Suppose R points 
{x r } r=1 are chosen from X and x r G P r . For each 1 < r < R, according to the Hoeffding's 
inequality 

T(x r ) 



Pr 



N 



E z [K h * (x r - Z)\ 



> e 



(25) 



where T is defined in (1241) and [Kh* (x r — Z)] = f x f (z) Kh* (x r — z) dz. By the union 
bound, the probability that the above event happens for {x r }fLi is less than 2Re~ Mfi * . It follows 
that with probability greater than 1 — 2Re 



-2Nh* 2D e 2 



T(x r ) 



N 



K Z [K h * (X r ~ Z)\ 



< £ 



(26) 



holds for any 1 < r < R. 

For any x G X , x G P r for some P r G P where P is the r-cover of X. By Lemma[8l 

^ \T (x) - T (x r )\ < ^-\\x-x r \\ < ^f^D T 



Also, since / is c-Lipschitz, 



|/ (x) — / (x r )\ < c || x — x r \\ < c\/~Dt 



(27) 



(28) 



Moreover, by Lemma [7] lim Kz [Kh* (x r — Z)] = f (x r ) holds for any 1 < r < R and this 

(2) 

convergence is uniform, so there exists TV such that 



\E Z [K h * (x r -Z)]-f(x r )\<T$ 



lim T$ = 



(29) 
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Combining ([26]> $2J} 428J) d29]) we get 

T(x 



N 



fix) 



<(T^+c)VDr + e + T^ 



(30) 



for any x £ X. It is worthwhile to mention that we can always choose h*,T,e which are small 

~ /min 

enough so that e < f mm (e.g. choose e < rnm first, then choose /i* small enough such that 

3 

T^} < mm , and choose r small enough to make ( Tff + cvT> ) r < mm ). Note that 
3 V / 3 



E s>h * = ^ E x \th (X) NN h * iS (X,j) 

i,j=l,...,Q,i^j 
N 



Vi (x) K h * (x-xi) 



i,j=l,...,Q,i=£j 1=1 
N 



E 5>, 



i,j=l,...,Q,i^j 1=1 



T(x) 

TTjfj (x) K h * (X - Xl) 

T(x) 



f (x) dx 



dx 



(31) 



Then (fTTT > follows by applying the inequality (1301) to (f3Tb . when h*,r,e are small enough such that 

£ ^ /min- B 

Proof [Proof of Theorem HI By Theorem [51 with probability at least 1 - 2Re~ Mh * , 



N 



v- , ^ — , f 7Tj fj (x) A"/,« (x — Xl) 

i+j i=i Jy 



(32) 



when e < / min where e = (T^V + v^r^ + e + T ft ( ?. Here we choose = r A r " ci ( 1)+1 ). It 
can be obtained that 



e< X ro + e + T® + cr VDN- d ^ 

11 AT 



(33) 



where A = VB / ^e 1 / 2 (2vr) D/2 J and r is chosen such that A r + e + T^ +cr v A DA r_a!(jD+1) < 

/min for sufficiently large N. Note that lim X 1 ,* = lim cToVDN~ d ( D+1 ^ = 0, after applying 
Lemma |7] we get 



iToo^^EE^^) 

Z=l 



7Ti/i (Xl) 



f Oz) ~ X TQ ~ £ 



(34) 



Following the similar argument 



N 



lim NE Sh * > y p y Plc 



N— >oo 



/ fa) + A r + e 



(35) 
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Also, since R is the number of elements in the r-cover of X, R < ( 2M °> , With the choice of r^r, 

lim Re~ Mh * 

AT->oo 

{2M ) D N dD{ - D+1 ) 2N l-2D d(27T) D £ 2 

< lim i - — = e~ 2N (2w > = (36) 

N->oo Tq 

Therefore, when N — > oo (fT2l holds with probability 1 . ■ 
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